Skip to content

Remove cpu limit for rayservice e2e test#4859

Open
AndySung320 wants to merge 5 commits into
ray-project:masterfrom
AndySung320:remove-cpu-limit
Open

Remove cpu limit for rayservice e2e test#4859
AndySung320 wants to merge 5 commits into
ray-project:masterfrom
AndySung320:remove-cpu-limit

Conversation

@AndySung320
Copy link
Copy Markdown
Contributor

Why are these changes needed?

Remove CPU resource limits from RayService e2e test specs.

Previously, a 500m CPU limit on the head pod caused dashboard startup timeouts and flaky tests (fixed in #4702 by raising the limit to 1).
However, CPU limits are unnecessary in this test environment and can still cause throttling under load. Removing them entirely eliminates this type of flakiness rather than tuning the limit value.
ref

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: AndySung320 <andysung0320@gmail.com>
@AndySung320 AndySung320 changed the title remove cpu limit for rayservice e2e test Remove cpu limit for rayservice e2e test May 22, 2026
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Copy link
Copy Markdown
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense to me, would not be surprised if CPU limits contribute to some level of flakiness in e2e tests

price: 2
ray_actor_options:
num_cpus: 0.1
- name: PearStand
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add PearStand?

Copy link
Copy Markdown
Contributor Author

@AndySung320 AndySung320 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Ray’s official core spec, actors default to num_cpus=1 for scheduling if not explicitly specified.
Because PearStand was defined in the graph but omitted in our serveConfigV2, it didn't get any custom ray_actor_options, so Ray automatically assigned it the default 1 CPU token.
Previously, this was masked because our head node had limits.cpu: 2 (which made KubeRay pass --num-cpus=2 to Ray). Now that we removed the limit, KubeRay falls back to using requests.cpu: 1. With only 1 total CPU token available in Ray, PearStand's default 1-CPU demand broke the budget and caused the scheduling failure.
Adding PearStand here with num_cpus=0.1 explicitly overrides Ray's 1-CPU default and aligns it with other deployments.

See the controller log showing PearStand failed to schedule with only 0.4 CPU available:
Screenshot 2026-05-22 at 1 17 31 PM

@AndySung320 AndySung320 marked this pull request as ready for review May 22, 2026 20:25
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
@AndySung320
Copy link
Copy Markdown
Contributor Author

The worker CPU limit in rayservice.autoscaling.yaml is intentionally kept.
The autoscaling test verifies that worker pods scale up under load. Without a CPU limit, Ray may detect enough CPUs on a single worker to fit all replicas, preventing additional worker pods from being created and causing the test to fail.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 5a93c83. Configure here.

cpu: "1"
memory: 1G
limits:
cpu: "1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker CPU limit not removed in autoscaling YAML

Medium Severity

The CPU limit (cpu: "500m") on the worker pod in rayservice.autoscaling.yaml was not removed, while CPU limits were consistently removed from both head and worker pods in all other RayService test YAML files (rayservice.static.yaml, rayservice.deletiondelay.yaml, ray-service.ft.yaml). This appears to be an oversight that leaves the autoscaling worker pod susceptible to the same CPU throttling-related flakiness this PR intends to eliminate.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5a93c83. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants